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Outlier detection refers to the detection of unexpected situations in the data. Outliers are fraud, hacking, 
mislabeled data, or unusual behavior in the system. Therefore, it is important to determine these values. In 
this study, outlier detection performances of the algorithms used in outlier detection analysis on different 
types of data sets were calculated and compared. As a result of the study, it was seen that the algorithms 
showed sufficient success. The highest performance was seen in the Histogram-based outlier detection 
algorithm with 99 % accuracy. 
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1. Introduction 


Outlier detection is used in many applications in data mining[1]. Outliers are a data object that deviates significantly 
from other objects in the data set, as shown in Figure | [2]. In Figure 1, the R region doesn’t follow the same distribution 
as other objects in the data set, so the R region can be defined as an outlier. Outliers occur as a result of data entry 
errors caused by changes in system behavior, fraud, forgery behavior, installation error, human errors during data 
collection [1]. Determining these values is of great importance to ensure the security of the system and to obtain more 
accurate results. Detection of outliers is of great importance for many applications such as fraud detection, intrusion 
detection, public safety, healthcare, damage detection, image processing [2]. 

Outliers are often in the minority in the data set. Therefore, it can be difficult to detect. Besides the scarcity of their 
numbers, as the size of the dataset increases, the data becomes more sparse and it can be difficult to capture 
neighborhood information because it becomes difficult to estimate the distances and density between the data [3]. In 
this study, experiments were carried out on different data sets to analyze the efficiency of anomaly detection on 
multidimensional data. In addition, the anomaly detection performance of algorithms on time series [4], which is 
widely used in military, economic and scientific fields, is examined on a multidimensional time series data set. 

In this study, CBLOF-BIRCH, LOF, k nearest neighbor, Angle-Based, histogram-based outlier detection algorithms 
in which Isolation Forest, CBLOF, CBLOF and BIRCH algorithms are used together are used to detect outliers on 
breast-cancer, pendigits and SKAB datasets. When the literature is examined, these data sets and algorithms have been 
used in many different ways in outlier detection. When the anomaly detection studies on breast-cancer and pendigits 
datasets are examined, it is seen that the studies generally focus on the LOF and Isolation Forest algorithms [5] [6] [7] 
[8] [9]. When the anomaly studies in the literature related to the SKAB dataset are examined, mostly LSTM, 
AutoEncoder, convolutional neural network, RNN, CPDE, MSET, Isolation Forest methods are used. [10] [11] 
[12][13] [14]. LOF, Isolation Forest, KNN, CBLOF algorithms have been frequently used in the literature for anomaly 
detection in different application areas such as medical applications and the banking sector. [5] [9] [15] [16] [17]. In 
addition, it has been stated in the literature [18] that the LOF algorithm gives successful results in intrusion detection. 
Anomaly detection studies on different time series data sets of the histogram-based anomaly method were encountered 
[15] [19] [20]. 
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Figure 1. R is outlier [2]. 


2. Related Work 


Some studies have been carried out in the literature on the detection of outliers. Some of them are mentioned in this 
section. 

In [23], outliers were detected using probabilistic, proximity-based and linear models in IoT-based structural health 
monitoring systems. For this, they used the shuttle data set in the UCI database. 

In this study [24], to find outliers on diabetes data obtained from a hospital, random forest, KNN, support vector 
machine algorithms and their proposed method hierarchical clustering-based support vector machine (HCSVM) 
method were used to analyze the performance of these algorithms. The results showed that the HCSVM algorithm 
performed the best outlier detection in the data set with 4805 normal data and 96 abnormal data. 

In [25], a new method, C-LSTM, consisting of a combination of convolutional neural network (CNN) and long short- 
term memory (LSTM) algorithms, was developed to detect abnormal values on a web traffic dataset Webscope S5 
with a one-dimensional time series signal. With the developed method, outliers in the dataset were detected. 

[26] used logistic regression, decision tree, k-nearest neighbor, random forest and autoencoder methods to detect fraud 
behavior in credit card transactions. 

In another study [27], fraud detection in online games and games of chance was investigated with clustering-based 
algorithms. 

In this study, multivariate and time series data sets of different sizes were used. The success of outlier detection on 
multidimensional data of Angle-based algorithms, which are claimed to have good performance in high-dimensional 
data [21], Isolation Forest, KNN, CBLOF, LOF, Histogram based algorithms, which are frequently used in different 
outlier detection applications in the literature, and BIRCH algorithm (was used with CBLOF), which is a good 
clustering method for large databases [22], algorithms has been examined. 


3. Methodology 


A. Dataset: 


In the study, breast-cancer [6], which consists of digitized features of a breast mass, pendigits [8] consisting of 
handwritten samples, and SKAB [10], which consists of time series data developed for anomaly studies, were used. 
The datasets were divided into 80 percent training and 20 percent test data. Table 1 shows the number of data samples, 
data size, data types, number and percentages values of outliers in each dataset. Also, in Table 1, the duration column 
shows the time period that the dataset contains. 
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Dataset Name Number of Number of Attribute Number of __ Percentage duration 
instances dimensional type: outliers of outliers 
Breast-cancer 366 30 Real 10 2.73 % - 
pendigits 6870 16 Numeric 156 2.27 % - 
SKAB 22473 10 Categorical, 7826 34.8 % 1 day 
Numeric (2020-03- 


Table 1. Data properties. 


B. Methods: 
Angle-based Outlier Detection (ABOD): 


It has been developed for high-dimensional data. The angle of each of them in the dataset with all the other data is 
looked at. The variance of each angle is calculated. If the result is less than the predetermined value, this data is 
considered outlier [21]. 


Local Outlier Factor (LOF): 


For each object in the dataset, a local outlier factor is determined, representing the outlier value. The LOF of most 
objects in a cluster is equal to 1. Min and max values are determined for other objects. The density distance of each 
object from its neighbors is measured. Data with a lower density than its neighbors is considered outlier. This method 
is related to the density-based clustering method [28]. 


Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH): 


It is a hierarchical clustering method. It has two features, Clustering Feature and Clustering Feature tree. A Clustering 
Feature Tree is initially created for clustering. Then, the entire dataset is scanned and the clustering feature value is 
calculated by creating subsets with the predetermined N value [2]. 


Cluster-based Local Outlier Factor (CBLOF): 


Computes an outlier score based on the cluster-based local outlier factor [29]. In cluster-based outlier detection, 
anomaly data occurs in three ways. The data may not belong to any cluster, the distance between the data and the 
closest cluster may be too far, or the outlier data may be a sparse cluster [2]. The CBLOF method calculates the size 
of the cluster to which the data belongs and the distance of this data from the cluster center to find outliers [30]. 


Histogram Based Outlier Detection (HBOS): 


Unsupervised method that calculates outlier degree by generating histograms [29]. It uses histograms to detect normal 
and outlier data. The height of the created histogram is of great importance in determining the anomaly data correctly. 
Because if the size is small, the accuracy of the normal data being at the height in the specified range decreases. 
Conversely, if it is large, it can cause outliers to appear as normal data. It is widely used especially in fraud detection 
applications [30]. 


K-Nearest Neighbors (KNN): 


The outlier score is calculated by taking the k-nearest neighbor of each data. It uses neighborhood information to detect 
outliers [1]. 
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Isolation Forest: 


It randomly selects a feature and isolates observations by randomly choosing a split value between the minimum and 
maximum values based on the selected feature [9]. 


C. Tools: 


In this research, the Python Pyod module was used to detect outliers. 


D. Evaluation: 


ROC analysis was developed to examine the performance states of the systems. ROC charts are used to show 
classification performances. In the ROC graph, the x-axis represents the false positive rate, that is, the proportion of 
misclassified data in the dataset, and the y-axis represents the proportion of correctly classified data in the dataset, the 
true positive rate. The area under a ROC curve is defined by the AUC. The larger this area, the better the success of 
the algorithm is considered [31]. The ROC-AUC value was taken as a criterion to analyze the success of the algorithms 


used in this study. 


E. Results: 


Abnormal and normal data samples in the datasets used in this study are shown in Figure 2 in three dimensions. When 
we look at the pendigits and breast-cancer datasets, it is seen that the abnormal data stand further from the normal data 
distribution. In the time series SKAB dataset, on the other hand, the outliers took place in the usual time period. 
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(c) Time-dependent variation of the feature determined in the SKAB dataset. Outliers are 
represented by yellow points. 


Figure 2. The distribution of outlier and normal data in datasets. 
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Figure 3. The AUC distribution of algorithms applied to different datasets. 


The performance distributions of each algorithm were shown in Figure 3 for the SKAB, breast-cancer, pendigits 
datasets by using ROC-AUC analysis graphs. As a result of the experiments, the HBOS algorithm performed better 
than other algorithms in detecting outliers in all data sets used. The HBOS algorithm showed its best performance on 
the breast-cancer dataset with a 99 % success rate. The lowest distribution on the datasets was seen on the SKAB 
dataset, which is a multivariate time series dataset. When the experimental results are examined in detail, it is seen that 
although the LOF algorithm has a high success rate of 96 % on the breast-cancer dataset, the same algorithm contains 
the lowest performance value of the study on the other two datasets, pendigits and SKAB datasets. All results of the 
study are presented in Table 1. 
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Algorithm pendigits breast-cancer 
ABOD 

CBLOF 

CBLOF-BIRCH 


LOF 

HBOS 

Isolation Forest 
KNN 


Table 2. The success of the algorithms on different datasets. The performance values shown represent 
ROC-AUC values. 


4. Conclusion 


With the widespread use of the Internet, outlier detection has become even more important in order to prevent situations 
such as forgery and fraud. In this study, the performance of outlier detection algorithms on different types of data sets 
is compared. As a result of the study, the histogram-based algorithm showed the highest success on the datasets used 
in the study. In the study, the algorithms generally have a lower success rate on the SKAB dataset than other datasets. 
In future studies, outlier analysis will be carried out on image data. 
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